Building a Large Annotated Corpus of English: The Penn Treebank

نویسندگان

  • Mitchell P. Marcus
  • Beatrice Santorini
  • Mary Ann Marcinkiewicz
چکیده

There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information about language from very large corpora. Such corpora are beginning to serve as important research tools for investigators in natural language processing, speech recognition, and integrated spoken language systems, as well as in theoretical linguistics. Annotated corpora promise to be valuable for enterprises as diverse as the automatic construction of statistical models for the grammar of the written and the colloquial spoken language, the development of explicit formal theories of the differing grammars of writing and speech, the investigation of prosodic phenomena in speech, and the evaluation and comparison of the adequacy of parsing models. In this paper, we review our experience with constructing one such large annotated corpus--the Penn Treebank, a corpus 1 consisting of over 4.5 million words of American English. During the first three-year phase of the Penn Treebank Project (1989-1992), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure. These materials are available to members of the Linguistic Data Consortium; for details, see Section 5.1. The paper is organized as follows. Section 2 discusses the POS tagging task. After outlining the considerations that informed the design of our POS tagset and presenting the tagset itself, we describe our two-stage tagging process, in which text is first assigned POS tags automatically and then corrected by human annotators. Section 3 briefly presents the results of a comparison between entirely manual and semi-automated tagging, with the latter being shown to be superior on three counts: speed, consistency, and accuracy. In Section 4, we turn to the bracketing task. Just as with the tagging task, we have partially automated the bracketing task: the output of

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Penn Discourse Treebank: Building a Large Scale Annotated Corpus Encoding DLTAG-based Discourse Structure and Discourse Relations

Large scale annotated corpora have played a critical role in speech and natural language research. However, while existing annotated corpora such as the Penn Treebank have been highly successful at the sentence-level, we also need large-scale annotated resources that reliably encode key aspects of discourse. In this paper, we detail (1) our plans for building the Penn Discourse Treebank (PDTB),...

متن کامل

Prague Czech-English Dependency Treebank. Syntactically Annotated Resources for Machine Translation

This paper introduces the Prague Czech-English Dependency Treebank (PCEDT), a new Czech-English parallel resource suitable for experiments in structural machine translation. We describe the process of building the core parts of the resources – a bilingual syntactically annotated corpus and translation dictionaries. A part of the Penn Treebank has been translated into Czech, the dependency annot...

متن کامل

The Penn Arabic Treebank: Building a Large-Scale Annotated Arabic Corpus

From our three year experience of developing a large-scale corpus of annotated Arabic text, our paper will address the following: (a) review pertinent Arabic language issues as they relate to methodology choices, (b) explain our choice to use the Penn English Treebank style of guidelines, (requiring the Arabic-speaking annotators to deal with a new grammatical system) rather than doing the anno...

متن کامل

Grammar-based Corpus Annotation

There is an increasing number of linguists interested in large syntactically annotated corpora (treebanks). Such corpora can serve as a base for statistical applications and, at the same time, may be used in theoretical linguistics as a source for investigations about language use. The most important treebank nowadays is the Penn Treebank (Marcus et al., 1993; Marcus et al., 1994). Many statist...

متن کامل

Extracting a Tree Adjoining Grammar from the Penn Arabic Treebank

Much progress in natural language processing (NLP) over the last decade has come from the combination of using corpora of annotated naturally occurring text along with machine learning algorithms. Following this trend, corpora have been created for other languages, such as the Penn Arabic Treebank (PATB) (Maamouri et al.2003). However, the corpora almost invariably need to reinterpreted for the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computational Linguistics

دوره 19  شماره 

صفحات  -

تاریخ انتشار 1993